By: Sushma Rao
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The detailed data dictionary is given below.
Data Dictionary
this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# loading data set
data = pd.read_csv('used_device_data.csv')
# checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 3454 rows and 15 columns.
# let's view a sample of the data
data.sample(n=10, random_state=1)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 866 | Others | Android | 15.240 | no | no | 8.000 | 2.000 | 16.000 | 4.000 | 3000.000 | 206.000 | 2014 | 632 | 179.610 | 56.760 |
| 957 | Celkon | Android | 10.160 | no | no | 3.150 | 0.300 | 512.000 | 0.250 | 1400.000 | 140.000 | 2013 | 637 | 48.650 | 16.450 |
| 280 | Infinix | Android | 15.390 | yes | no | NaN | 8.000 | 32.000 | 2.000 | 5000.000 | 185.000 | 2020 | 329 | 88.880 | 79.100 |
| 2150 | Oppo | Android | 12.830 | yes | no | 13.000 | 16.000 | 64.000 | 4.000 | 3200.000 | 148.000 | 2017 | 648 | 281.300 | 107.540 |
| 93 | LG | Android | 15.290 | yes | no | 13.000 | 5.000 | 32.000 | 3.000 | 3500.000 | 179.000 | 2019 | 216 | 200.420 | 91.620 |
| 1040 | Gionee | Android | 12.830 | yes | no | 13.000 | 8.000 | 32.000 | 4.000 | 3150.000 | 166.000 | 2016 | 970 | 279.870 | 104.130 |
| 3170 | ZTE | Others | 10.160 | no | no | 3.150 | 5.000 | 16.000 | 4.000 | 1400.000 | 125.000 | 2014 | 1007 | 69.710 | 43.140 |
| 2742 | Sony | Android | 12.700 | yes | no | 20.700 | 2.000 | 16.000 | 4.000 | 3000.000 | 170.000 | 2013 | 1060 | 330.240 | 83.330 |
| 102 | Meizu | Android | 15.290 | yes | no | NaN | 20.000 | 128.000 | 6.000 | 3600.000 | 165.000 | 2019 | 332 | 420.170 | 142.510 |
| 1195 | HTC | Android | 10.290 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 2000.000 | 146.000 | 2015 | 892 | 131.500 | 68.570 |
#checking for the data types
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 new_price 3454 non-null float64 14 used_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
observations:
brand name,os,4g,5g are of the type object.
2.screen size,main camera mp, selfie camera mp, int memory, ram, battery, weight, new price, used price are of the type float.
3.release year and days used are of the type integer.
we can clearly see that there are few missing values.
# checking for duplicate values
data.duplicated().value_counts()
False 3454 dtype: int64
observation:There are no duplicated values.
# checking for missing values in the data
data.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
observations: main camera mp ,selfies camera mp, int memory,ram,battery,weight has missing values. main camera mp has the highest number of missing values.
data.describe(include="all")
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3454 | 3454 | 3454.000 | 3454 | 3454 | 3275.000 | 3452.000 | 3450.000 | 3450.000 | 3448.000 | 3447.000 | 3454.000 | 3454.000 | 3454.000 | 3454.000 |
| unique | 34 | 4 | NaN | 2 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Others | Android | NaN | yes | no | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 502 | 3214 | NaN | 2335 | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | 13.713 | NaN | NaN | 9.460 | 6.554 | 54.573 | 4.036 | 3133.403 | 182.752 | 2015.965 | 674.870 | 237.039 | 92.303 |
| std | NaN | NaN | 3.805 | NaN | NaN | 4.815 | 6.970 | 84.972 | 1.365 | 1299.683 | 88.413 | 2.298 | 248.580 | 194.303 | 54.702 |
| min | NaN | NaN | 5.080 | NaN | NaN | 0.080 | 0.000 | 0.010 | 0.020 | 500.000 | 69.000 | 2013.000 | 91.000 | 18.200 | 4.650 |
| 25% | NaN | NaN | 12.700 | NaN | NaN | 5.000 | 2.000 | 16.000 | 4.000 | 2100.000 | 142.000 | 2014.000 | 533.500 | 120.343 | 56.483 |
| 50% | NaN | NaN | 12.830 | NaN | NaN | 8.000 | 5.000 | 32.000 | 4.000 | 3000.000 | 160.000 | 2015.500 | 690.500 | 189.785 | 81.870 |
| 75% | NaN | NaN | 15.340 | NaN | NaN | 13.000 | 8.000 | 64.000 | 4.000 | 4000.000 | 185.000 | 2018.000 | 868.750 | 291.115 | 116.245 |
| max | NaN | NaN | 30.710 | NaN | NaN | 48.000 | 32.000 | 1024.000 | 12.000 | 9720.000 | 855.000 | 2020.000 | 1094.000 | 2560.200 | 749.520 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "used_price")##code to create histogram_boxplot for 'used_price'
observations:
1. The distribution of used price looks right skewed.
2. There are a few outliers.
3. The mean is greater than the median.
histogram_boxplot(data, "new_price")##code to create histogram_boxplot for 'new_price'
observations: The distribution of new price looks right skewed with outliers. The distribution looks similar to that of used price.
histogram_boxplot(data, "screen_size")##code to create histogram_boxplot for 'screen size'
Observations: The plot clearly shows that there are many outliers,the used devices can be either phone or a tablet and the size of the screen will definetly vary.so i would like to consider all the data points or else it might alter the model performance.
histogram_boxplot(data, "selfie_camera_mp")##code to create histogram_boxplot for 'selfie camera mp'
Observations:The distribution is right skewed.
histogram_boxplot(data, "main_camera_mp")##code to create histogram_boxplot for 'main camera mp'
Observations:The distribution is skewed.
histogram_boxplot(data, "int_memory")##code to create histogram_boxplot for 'int_memory'
Observation: The distribution is definetly skewed.
histogram_boxplot(data, "ram")##code to create histogram_boxplot for 'ram'
observation: most of the devices have 4GB RAM. The plot shows some outliers too.
histogram_boxplot(data, "weight")##code to create histogram_boxplot for 'weight'
Observations:The distribution is positively skewed .looks like the heavy weighted devices are the used tablets, and they are few in number.
histogram_boxplot(data, "battery")##code to create histogram_boxplot for 'battery'
Observations: There are more devices with energy capacity of the battery below 4000 mAH. the plot clearly shows there are ouliers.
histogram_boxplot(data, "days_used")##code to create histogram_boxplot for 'days_used'
Observations: The distribution looks left skewed.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "os", perc=True, n=10) ## code to create labeled_barplot for 'os'
data['os'].value_counts()
Android 3214 Others 137 Windows 67 iOS 36 Name: os, dtype: int64
observations:
1. The labeled bar chart clearly shows that used device market is clearly dominated by Android os.
2. 93.1% of the used devices use Android os.
labeled_barplot(data, "brand_name", perc=True, n=34)## code to create labeled_barplot for 'brand name'
Observations: The brand name for most of the devices is entered as others. Nearly 14.5% of the devices doent have a proper brand name.
Samsung and Huawei are the 2 brand names with better percentage compared to all other brands.
labeled_barplot(data, "4g", perc=True, n=10)## code to create labeled_barplot for '4g
labeled_barplot(data, "5g", perc=True, n=10)## code to create labeled_barplot for '5g'
Observations:
1.Only 4.4% have 5g availability.
2.Nearly 68% of the devices have 4g .
labeled_barplot(data, "release_year", perc=True, n=10)## code to create labeled_barplot for 'release year'
Observation:Most of the used devices belong to 2013 and 2014 .
##lets find the relationship between the numerical variables.
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x21d90eae9a0>
cols_list = data.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
cols_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Observations:
import matplotlib.pyplot as plot##lets do bivariate analysis of the used price with screen size
plot.scatter(data['screen_size'],data['used_price'])
<matplotlib.collections.PathCollection at 0x21d8611a220>
import matplotlib.pyplot as plot##bivariate analysis of used price with new price
plot.scatter(data['new_price'],data['used_price'])
<matplotlib.collections.PathCollection at 0x21d86163550>
import matplotlib.pyplot as plot#bivariate analysis of screen size and battery
plot.scatter(data['screen_size'],data['battery'])
<matplotlib.collections.PathCollection at 0x21d861f2e50>
import matplotlib.pyplot as plot##bivariate analysis of weight and screen size
plot.scatter(data['screen_size'],data['weight'])
<matplotlib.collections.PathCollection at 0x21d86226fd0>
import matplotlib.pyplot as plot##bivariate analysis of battery and weight
plot.scatter(data['battery'],data['weight'])
<matplotlib.collections.PathCollection at 0x21d8629c820>
Observations: There seems to some linear relationship between used price and new price , and also between battery and weight.
plt.figure(figsize=(15, 5))# analysis of ram for different brand names
sns.barplot(data=data, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()
Observations:
1. Oneplus brand offers the most RAM , while Celkon offers the least RAM.
2.Nokia and Infinix also offers less RAM, but better compared to Celkon.
3. All other brands offer almost same RAM, which is decent compared to Celkon, Nokia and Infinix.
4. One plus is the one to be considered when anyone is lookind for more RAM.
# lets get the data set where the weight of the battery is more than 4500 mah.
df_large_battery = data[data.battery > 4500]
df_large_battery.shape
(341, 15)
observations: There are 341 devices whose batteries energy capacity is ore than 4500mAh.
#lets check the weight for different battery capacities.
df_large_battery.groupby("battery")["weight"].mean().sort_values(ascending=True)
battery
5870.000 150.000
6180.000 167.500
5180.000 195.000
5020.000 205.100
4520.000 206.500
...
9720.000 636.000
9300.000 637.000
8400.000 655.000
9500.000 743.033
8380.000 855.000
Name: weight, Length: 72, dtype: float64
observations: the output clearly shows that as the energy capacity of the battery increases, the weight of the device also increases.
##lets check the weight of the device for different brands with battery capacity >4500mah
df_large_battery.groupby("brand_name")["weight"].mean().sort_values(ascending=False)
brand_name Google 517.000 Lenovo 442.721 Apple 439.559 Sony 439.500 HTC 425.000 Samsung 398.352 Huawei 394.486 Others 390.546 Alcatel 380.000 LG 366.058 Acer 360.000 Nokia 318.000 Asus 313.773 Honor 248.714 Xiaomi 231.500 Gionee 209.430 Motorola 200.757 Realme 196.833 Vivo 195.631 ZTE 195.400 Oppo 195.000 Infinix 193.000 Panasonic 182.000 Spice 158.000 Micromax 118.000 Name: weight, dtype: float64
observations:The weight of the Google devices are more ,followed by lenevo,Apple and Sony, Micromax has the least weight among all the given brands.
plt.figure(figsize=(15, 5))
sns.barplot( x="brand_name",y="weight" ,data=df_large_battery) ## barplot for 'brand_name' and 'weight' for devices having large batteries.
plt.xticks(rotation=60)
plt.show()
df_large_battery.describe(include='all')
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 341 | 341 | 341.000 | 341 | 341 | 287.000 | 341.000 | 341.000 | 341.000 | 341.000 | 341.000 | 341.000 | 341.000 | 341.000 | 341.000 |
| unique | 25 | 3 | NaN | 2 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Samsung | Android | NaN | yes | no | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 50 | 324 | NaN | 275 | 317 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | 19.767 | NaN | NaN | 9.186 | 7.237 | 59.261 | 4.252 | 5858.104 | 332.276 | 2017.117 | 568.437 | 303.644 | 128.041 |
| std | NaN | NaN | 4.803 | NaN | NaN | 4.915 | 6.290 | 87.071 | 1.476 | 1279.176 | 155.502 | 2.462 | 277.996 | 198.876 | 51.428 |
| min | NaN | NaN | 6.680 | NaN | NaN | 0.300 | 0.300 | 16.000 | 1.000 | 4520.000 | 118.000 | 2013.000 | 92.000 | 80.820 | 30.600 |
| 25% | NaN | NaN | 15.420 | NaN | NaN | 5.000 | 2.000 | 16.000 | 4.000 | 5000.000 | 198.000 | 2015.000 | 319.000 | 179.980 | 94.490 |
| 50% | NaN | NaN | 20.320 | NaN | NaN | 8.000 | 5.000 | 32.000 | 4.000 | 5100.000 | 300.000 | 2017.000 | 582.000 | 250.000 | 116.200 |
| 75% | NaN | NaN | 25.430 | NaN | NaN | 13.000 | 8.000 | 64.000 | 4.000 | 6660.000 | 467.000 | 2019.000 | 780.000 | 349.840 | 147.230 |
| max | NaN | NaN | 30.710 | NaN | NaN | 48.000 | 32.000 | 1024.000 | 12.000 | 9720.000 | 855.000 | 2020.000 | 1089.000 | 1200.850 | 472.940 |
observations:
1.Google, HTC,Micromax,Nokia,Spice and Oppo have same tpye of device, hence the black line on the bars are missing.
2.The Google brand device weighs the most , may be its a tablet.
3.Micromax brand weighs the least and may be its a mobile phone.
df_large_screen = data[data.screen_size > 6*2.54]#create a data set having screen size >6 inches
df_large_screen.shape
(1099, 15)
Observations: There are 1099 devices available with screen siZe larger than 6 inches.
labeled_barplot(df_large_screen, "brand_name", perc=True, n=32)# percentage of devices from all brands with screen size>6in
df_large_screen["brand_name"].value_counts()
Huawei 149 Samsung 119 Others 99 Vivo 80 Honor 72 Oppo 70 Lenovo 69 Xiaomi 69 LG 59 Motorola 42 Asus 41 Realme 40 Alcatel 26 Apple 24 Acer 19 ZTE 17 Meizu 17 OnePlus 16 Nokia 15 Sony 12 Infinix 10 HTC 7 Micromax 7 Google 4 XOLO 3 Coolpad 3 Gionee 3 Panasonic 2 Karbonn 2 Spice 2 Microsoft 1 Name: brand_name, dtype: int64
Observations:
1. Huawei brand has the devices with screen size larger than 6 inches followed by Samsung.
2. Microsoft has only one device with the screen size larger than 6inches.
3.Huawei brand provides bigger screens for costumers that are desirable for entertainment purposes as they offer a better viewing experience
data['new_price'].describe()#statistical analysis of new price
count 3454.000 mean 237.039 std 194.303 min 18.200 25% 120.343 50% 189.785 75% 291.115 max 2560.200 Name: new_price, dtype: float64
observation:
The mean of the new price of the devices is around 238 euros, in order to find the distribution of budget devices ,our first step is to cut the dat set into budget friendly, mid ranger and premium categories
## dividing the devices into budget,mid ranger and premium.
data["device_category"] = pd.cut(
x=data.new_price,
bins=[-np.infty, 200, 350, np.infty],
labels=["Budget", "Mid-ranger", "Premium"],
)
data["device_category"].value_counts()
Budget 1844 Mid-ranger 1025 Premium 585 Name: device_category, dtype: int64
labeled_barplot(data, "device_category", perc=True, n=10) ## create labeled_barplot for 'device_category'
observations: The percentage of the Budget friendly devices are more compared to the premium .
df_selfie_camera = data[data.selfie_camera_mp > 8.0] ## create a dataframe having selfie_camera_mp > 8
df_selfie_camera.shape
(655, 16)
plt.figure(figsize=(15, 5))
sns.countplot(data=df_selfie_camera,x="brand_name",hue="device_category") ## create countplot for 'brandname' with hue = 'device_category' on df_selfie_camera
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
data_selfie=data[data.device_category == "Budget"]#create a dataset only for budget.
df_selfie_camera1 = data_selfie[data_selfie.selfie_camera_mp > 8.0] ## create a dataframe having selfie_camera_mp > 8
df_selfie_camera1.shape
(126, 16)
plt.figure(figsize=(15, 5))
sns.countplot(data=df_selfie_camera1,x="brand_name") ## create countplot for 'brandname' with hue = 'device_category' on df_selfie_camera
plt.xticks(rotation=60)
plt.show()
plt.figure(figsize=(25, 5))
sns.displot(data=df_selfie_camera1,x="selfie_camera_mp",kde=True) ## code to create displot for selfie_camera_mp
plt.xticks(rotation=60)
plt.show()
<Figure size 1800x360 with 0 Axes>
df_selfie_camera1["brand_name"].value_counts()
Xiaomi 20 Huawei 13 Realme 13 Motorola 10 Samsung 9 LG 9 Vivo 8 Honor 7 Lenovo 6 Oppo 4 Sony 4 Meizu 4 Infinix 4 Others 3 ZTE 3 HTC 3 Asus 2 Nokia 2 Panasonic 1 Coolpad 1 Name: brand_name, dtype: int64
observations:
1. we have divided the device into budget, midranger and premium based on the new price.
2. There are more number of budget friendly devices.
3. we have created a data set having selfie camera mp more than 8 mp.
4.It is clear from both the count plots that Xiaomi is the brand which is Budget friendly and also offers a selfie camera which is greater than 8 mp, followed by Huawei and Realme.
5.The distribution plot clearly shows that the distribution is not normal.
data.corr()# lets check the correlation between the variables
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| screen_size | 1.000 | 0.150 | 0.272 | 0.071 | 0.274 | 0.814 | 0.829 | 0.364 | -0.292 | 0.341 | 0.529 |
| main_camera_mp | 0.150 | 1.000 | 0.429 | 0.019 | 0.261 | 0.249 | -0.088 | 0.354 | -0.145 | 0.358 | 0.459 |
| selfie_camera_mp | 0.272 | 0.429 | 1.000 | 0.296 | 0.477 | 0.370 | -0.005 | 0.691 | -0.553 | 0.416 | 0.615 |
| int_memory | 0.071 | 0.019 | 0.296 | 1.000 | 0.122 | 0.118 | 0.015 | 0.235 | -0.243 | 0.369 | 0.378 |
| ram | 0.274 | 0.261 | 0.477 | 0.122 | 1.000 | 0.281 | 0.090 | 0.314 | -0.280 | 0.494 | 0.529 |
| battery | 0.814 | 0.249 | 0.370 | 0.118 | 0.281 | 1.000 | 0.703 | 0.489 | -0.371 | 0.370 | 0.550 |
| weight | 0.829 | -0.088 | -0.005 | 0.015 | 0.090 | 0.703 | 1.000 | 0.071 | -0.067 | 0.219 | 0.358 |
| release_year | 0.364 | 0.354 | 0.691 | 0.235 | 0.314 | 0.489 | 0.071 | 1.000 | -0.750 | 0.304 | 0.495 |
| days_used | -0.292 | -0.145 | -0.553 | -0.243 | -0.280 | -0.371 | -0.067 | -0.750 | 1.000 | -0.246 | -0.386 |
| new_price | 0.341 | 0.358 | 0.416 | 0.369 | 0.494 | 0.370 | 0.219 | 0.304 | -0.246 | 1.000 | 0.809 |
| used_price | 0.529 | 0.459 | 0.615 | 0.378 | 0.529 | 0.550 | 0.358 | 0.495 | -0.386 | 0.809 | 1.000 |
cols_list = data.select_dtypes(include=np.number).columns.tolist()##creating a list with only numeric data types.
plt.figure(figsize=(25, 5))
plt.subplot(122)
sns.heatmap( data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral" )
plt.show()
Observations:
1. used price is highly correlated with new price and the selfie camera mp.
2. it clearly shows that the price of the used devices tends to be more if the price of the new device was more.
3. good selfie camera is also one of the important factor the customers look for ,when they buy used devices.
4. RAM,battery and screen size may also play role after the selfie camera mp and new price.
plt.figure(figsize=(15, 5))##lets ckeck the relationship between brand name and main camera mp
plot.scatter(data['brand_name'],data['main_camera_mp'])
plt.xticks(rotation=60)
plt.show()
Observation: There is no linear relationship between brand name and main camera mp.
plt.figure(figsize=(15, 5))##lets check relationship between main camera mp and used price
plot.scatter(data['main_camera_mp'],data['used_price'])
plt.xticks(rotation=60)
plt.show()
Observation:The plot shows no signs of linear relationship between main camera mp and used price.
plt.figure(figsize=(10, 5))
sns.barplot(data=data,x="release_year",y="used_price") ## barplot for release year and used price
plt.show()
Observation:The plot cleary shows that the price of the used devices increases as the release year is closer to the current time.2013 has the least price and 2020 has the most.
plt.figure(figsize=(10, 5))
sns.barplot(data=data,x="4g",y="used_price") ## barplot for release year and used price
plt.show()
Observation: The price of the devices is more if they have 4g compared to that with no 4g.
plt.figure(figsize=(10, 5))
sns.barplot(data=data,x="5g",y="used_price") ## barplot for release year and used price
plt.show()
Observation:The price of the used devices is more with 5g connection compared to that with no 5g.
plt.figure(figsize=(15, 5))#bivariate anlysis of used price and days used
plot.scatter(data['days_used'],data['used_price'])
plt.xticks(rotation=60)
plt.show()
Observation: There is no relationship between days used and the price of the used device.
##lets check how many devices use both 4g and 5g
data1=data[(data['4g']=='yes')& (data['5g']=='yes')]
data1.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | device_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Honor | Android | 17.300 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 174.530 | Mid-ranger |
| 2 | Honor | Android | 16.690 | yes | yes | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 165.850 | Premium |
| 3 | Honor | Android | 25.500 | yes | yes | 13.000 | 8.000 | 64.000 | 6.000 | 7250.000 | 480.000 | 2020 | 345 | 278.930 | 169.930 | Mid-ranger |
| 12 | Honor | Android | 16.690 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4100.000 | 206.000 | 2019 | 537 | 499.680 | 217.730 | Premium |
| 27 | Huawei | Android | 15.370 | yes | yes | 10.500 | 16.000 | 128.000 | 6.000 | 4000.000 | 192.000 | 2020 | 91 | 249.730 | 135.030 | Mid-ranger |
plt.figure(figsize=(15, 5))##lets check how the price variation for devices using both 4g and 5g
plot.scatter(data1['new_price'],data1['used_price'])
plt.xticks(rotation=60)
plt.show()
Observations:
1.only Android devices have both 4g and 5g.
2.All the devices are of recent years 2019 and 2020.
plt.figure(figsize=(10, 5))
sns.barplot(data=data,x="os",y="used_price") ## lets check the used price for different tpyes of os.
plt.show()
Observations: The price of the used devices is more for ios operating system.
plt.figure(figsize=(10, 5))
sns.barplot(data=data,x="ram",y="used_price") ## lets check the used price for ram.
plt.show()
Observation: The price of the used devices is more with more RAM.
df_pre=data.copy()#lets copy the data set
df_pre.isnull().sum()##lets check for the missing values again
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 device_category 0 dtype: int64
1.There are missing values.
2.lets check for the distribution of the variables with missing data .
3.if the distributions are skewed ,then we will replace the missing values with their median .
4.if the didtributions are normal,then we will replace the missing values with their mean.
histogram_boxplot(df_pre, "main_camera_mp")#lets check for mean and median
histogram_boxplot(df_pre, "selfie_camera_mp")#lets check for mean and median
histogram_boxplot(df_pre, "int_memory")#lets check for mean and median
histogram_boxplot(df_pre, "ram")#lets check for mean and median
histogram_boxplot(df_pre, "battery")#lets check for mean and median
histogram_boxplot(df_pre, "weight")#lets check for mean and median
Observations:
1. All the attributes with the missing values are skewed,so we can replace the missing values with their median respectively.
#lets create a list with the variables with missing values and use a lambda function to replace the missing values withtheir median
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
for col in cols_impute:
df_pre[col] = df_pre[col].transform(
lambda x: x.fillna(x.median()))
Lets check for the missing values
df_pre.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 new_price 0 used_price 0 device_category 0 dtype: int64
observation:There are no missing values.
Lets do log Transformation for the used price and the new price ,as they both are positively skewed and correlated with each other.
#log transformation of used price
df_pre["used_price_log"] = np.log(df_pre["used_price"])
histogram_boxplot(df_pre, "used_price_log")
Observations: The distribution almost looks normal with some outliers.
#log transformation of used price
df_pre["new_price_log"] = np.log(df_pre["new_price"])
histogram_boxplot(df_pre, "new_price_log")
Observation:The distribution of new price log looks almost normal with some outliers.
We have to deal with the categorical variables before building the linear regression Model. we will create dummy variables for all the categorical independent features.
df_model=df_pre.copy()#lets copy the data set
# creating dummy variables
df_model = pd.get_dummies(
df_model,
columns=df_model.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
df_model.head()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | ... | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | device_category_Mid-ranger | device_category_Premium | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.500 | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 17.300 | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| 2 | 16.690 | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 3 | 25.500 | 13.000 | 8.000 | 64.000 | 6.000 | 7250.000 | 480.000 | 2020 | 345 | 278.930 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| 4 | 15.320 | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 53 columns
#lets check for all the columns
df_model.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'used_price', 'used_price_log', 'new_price_log', 'brand_name_Alcatel',
'brand_name_Apple', 'brand_name_Asus', 'brand_name_BlackBerry',
'brand_name_Celkon', 'brand_name_Coolpad', 'brand_name_Gionee',
'brand_name_Google', 'brand_name_HTC', 'brand_name_Honor',
'brand_name_Huawei', 'brand_name_Infinix', 'brand_name_Karbonn',
'brand_name_LG', 'brand_name_Lava', 'brand_name_Lenovo',
'brand_name_Meizu', 'brand_name_Micromax', 'brand_name_Microsoft',
'brand_name_Motorola', 'brand_name_Nokia', 'brand_name_OnePlus',
'brand_name_Oppo', 'brand_name_Others', 'brand_name_Panasonic',
'brand_name_Realme', 'brand_name_Samsung', 'brand_name_Sony',
'brand_name_Spice', 'brand_name_Vivo', 'brand_name_XOLO',
'brand_name_Xiaomi', 'brand_name_ZTE', 'os_Others', 'os_Windows',
'os_iOS', '4g_yes', '5g_yes', 'device_category_Mid-ranger',
'device_category_Premium'],
dtype='object')
Observation: There are 53 columns now,after creating dummy variables.
outlier detection can be made using
lets check for the ouliers in the data set using boxplot and z_score
#create a list containing only numerical variables.
num_var=["screen_size","main_camera_mp","selfie_camera_mp","int_memory","ram","battery","weight","days_used","new_price_log","used_price_log"]
plt.figure(figsize=(10, 15))#lets plot the box plot for all numerical variables
sns.boxplot(data=df_pre[num_var],orient='h',palette='Set2')
plt.show()
from scipy import stats## calculate Z score for screen size and print the outliers
import numpy as np
z=np.abs(stats.zscore(df_pre["screen_size"]))
threshold=3
print(np.where(z>3))
(array([ 3, 19, 33, 36, 40, 49, 66, 67, 75, 83, 222,
238, 252, 255, 259, 268, 285, 286, 294, 302, 366, 368,
371, 373, 382, 388, 389, 390, 391, 403, 404, 423, 433,
460, 476, 633, 642, 644, 645, 653, 654, 662, 679, 749,
773, 785, 786, 801, 806, 811, 925, 992, 1086, 1141, 1142,
1151, 1276, 1277, 1295, 1296, 1308, 1330, 1341, 1378, 1409, 1506,
1507, 1520, 1530, 1544, 1567, 1584, 1598, 1602, 1611, 1664, 1692,
1735, 1736, 1962, 2089, 2270, 2275, 2276, 2371, 2372, 2373, 2375,
2376, 2450, 2546, 2547, 2563, 2564, 2565, 2581, 2582, 2583, 2584,
2585, 2586, 2587, 2588, 2608, 2619, 2620, 2621, 2713, 2714, 2733,
2734, 2749, 2750, 2781, 2786, 2787, 2788, 2884, 2888, 2937],
dtype=int64),)
from scipy import stats## calculate Z score for selfie camera mp and print the outliers
import numpy as np
z1=np.abs(stats.zscore(df_pre["selfie_camera_mp"]))
threshold=3
print(np.where(z1>3))
(array([ 16, 41, 47, 52, 53, 73, 84, 122, 123, 124, 127,
128, 131, 155, 157, 162, 163, 166, 176, 194, 201, 202,
235, 260, 266, 271, 272, 292, 303, 341, 342, 343, 346,
347, 350, 364, 365, 370, 1088, 1089, 1255, 1256, 1257, 1258,
2364, 2365, 3195, 3199, 3200, 3201, 3203, 3205, 3208, 3214, 3219,
3221, 3250, 3251, 3252, 3253, 3254, 3280, 3286, 3287, 3288, 3291,
3311, 3312, 3313, 3317, 3320, 3336, 3340, 3341, 3342, 3344, 3346,
3349, 3355, 3360, 3362, 3391, 3392, 3393, 3394, 3395, 3415, 3419,
3428, 3429, 3431, 3432, 3437, 3439], dtype=int64),)
from scipy import stats## calculate Z score for main camera mp and print the outliers
import numpy as np
z2=np.abs(stats.zscore(df_pre["main_camera_mp"]))
threshold=3
print(np.where(z2>3))
(array([ 210, 215, 595, 850, 1052, 2097, 2283, 3069, 3240, 3381],
dtype=int64),)
from scipy import stats## calculate Z score for ram and print the outliers
import numpy as np
z3=np.abs(stats.zscore(df_pre["ram"]))
threshold=3
print(np.where(z3>3))
(array([ 44, 109, 120, 198, 263, 328, 339, 372, 3250, 3252, 3391,
3393, 3420, 3421, 3422, 3424, 3436, 3440], dtype=int64),)
from scipy import stats## calculate Z score for weight and print the outliers
import numpy as np
z4=np.abs(stats.zscore(df_pre["weight"]))
threshold=3
print(np.where(z4>3))
(array([ 3, 21, 33, 36, 40, 49, 66, 67, 75, 83, 222,
240, 252, 255, 259, 268, 285, 286, 294, 302, 366, 371,
382, 388, 389, 390, 391, 403, 404, 419, 433, 460, 612,
614, 633, 642, 644, 645, 649, 653, 654, 655, 662, 668,
679, 723, 731, 749, 773, 785, 786, 801, 806, 811, 925,
992, 1086, 1141, 1142, 1151, 1276, 1277, 1295, 1296, 1307, 1330,
1341, 1378, 1409, 1506, 1507, 1520, 1530, 1544, 1545, 1567, 1584,
1598, 1602, 1611, 1664, 1692, 1735, 1736, 1962, 2089, 2270, 2271,
2273, 2274, 2275, 2276, 2279, 2372, 2373, 2450, 2486, 2490, 2491,
2546, 2547, 2563, 2564, 2565, 2581, 2582, 2583, 2584, 2585, 2586,
2587, 2588, 2619, 2620, 2621, 2749, 2750, 2777, 2786, 2787, 2788,
2884, 2937, 3040, 3051], dtype=int64),)
from scipy import stats## calculate Z score for new price log and print the outliers
import numpy as np
z5=np.abs(stats.zscore(df_pre["new_price_log"]))
threshold=3
print(np.where(z5>3))
(array([ 44, 114, 198, 263, 333, 534, 618, 630, 884, 885, 886,
898, 952, 953, 954, 955, 965, 1262, 1272, 1898, 1903, 1904,
1924, 1925, 1926, 1927, 2288, 2320, 2324, 2358, 2533, 3207, 3348,
3436], dtype=int64),)
from scipy import stats## calculate Z score for target variable and print the outliers
import numpy as np
z6=np.abs(stats.zscore(df_pre["used_price_log"]))
threshold=3
print(np.where(z6>3))
(array([ 34, 104, 198, 323, 332, 333, 520, 533, 534, 645, 884,
885, 886, 897, 898, 951, 952, 953, 954, 955, 965, 966,
968, 1898, 1903, 1925, 1929, 2035, 2064, 2135, 2320, 2324, 2327,
2533, 3207], dtype=int64),)
Using IQR method
##lets find the data at their 25th and 75th percentile
percentile25 = df_model['battery'].quantile(0.25)
percentile75 = df_model['battery'].quantile(0.75)
iqr=percentile75-percentile25
iqr
1900.0
#lets check the upper and lower limits using IQR
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
#create new data set excluding the outliers
new_df=df_model[(df_model["battery"] > lower_limit) & (df_model["battery"] < upper_limit)]
new_df.shape
(3377, 53)
histogram_boxplot(new_df, "battery")#check for outliers
##lets find the data at their 25th and 75th percentile
percentile251 = new_df['weight'].quantile(0.25)
percentile751 = new_df['weight'].quantile(0.75)
iqr1=percentile751-percentile251
iqr1
41.5
#lets check the upper and lower limits using IQR
upper_limit1 = percentile751 + 1.5 * iqr1
lower_limit1 = percentile251 - 1.5 * iqr1
##lets create a new datset excluding the outliers
new_df1=new_df[(new_df["weight"] > lower_limit1) & (new_df["weight"] < upper_limit1)]
new_df1.shape
(3075, 53)
outlier detection using boxplot, Zscore and IQR are done.
we have removed the outliers for battery and weight.
lets Explore the data again after the missing value imputation and feature engineering.
cols_list =new_df1.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(70, 15))
plt.subplot(122)
sns.heatmap( new_df1[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral" )
plt.show()
plt.figure(figsize=(10, 15))#lets plot the box plot for all numerical variables
sns.boxplot(data=new_df1,orient='h',palette='Set2')
plt.show()
Observation: The plots show the exploratory data anlalysis of the data set after missing value treatment and feature engineering. dependent variable is correlated with one or more independent variables, which we will attend to while model building.
We want to predict the used device price, so we will use the normalized version used_price_log for modeling.
We'll split the data into train and test to be able to evaluate the model.
# defining the dependent and independent variables
X =new_df1.drop(columns=['new_price','used_price','used_price_log','device_category_Mid-ranger','device_category_Premium'])## drop "new_price", "used_price", "used_price_log", "device_category" from the data
y = new_df1["used_price_log"]
print(X.head())
print()
print(y.head())
screen_size main_camera_mp selfie_camera_mp int_memory ram battery \ 0 14.500 13.000 5.000 64.000 3.000 3020.000 1 17.300 13.000 16.000 128.000 8.000 4300.000 2 16.690 13.000 8.000 128.000 8.000 4200.000 4 15.320 13.000 8.000 64.000 3.000 5000.000 5 16.230 13.000 8.000 64.000 4.000 4000.000 weight release_year days_used new_price_log ... brand_name_Spice \ 0 146.000 2020 127 4.715 ... 0 1 213.000 2020 325 5.519 ... 0 2 213.000 2020 162 5.885 ... 0 4 185.000 2020 293 4.948 ... 0 5 176.000 2020 223 5.061 ... 0 brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE \ 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 4 0 0 0 0 5 0 0 0 0 os_Others os_Windows os_iOS 4g_yes 5g_yes 0 0 0 0 1 0 1 0 0 0 1 1 2 0 0 0 1 1 4 0 0 0 1 0 5 0 0 0 1 0 [5 rows x 48 columns] 0 4.308 1 5.162 2 5.111 4 4.390 5 4.414 Name: used_price_log, dtype: float64
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
) ## code to split the data into train and test in specified ratio
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2152 Number of rows in test data = 923
Let's build a linear regression model using statsmodels.
# adding constant to the train data
x_train = sm.add_constant(x_train)
# adding constant to the test data
x_test = sm.add_constant(x_test)## code to add contant to the test data
olsmodel1 = sm.OLS(y_train, x_train)## code to fit OLS model
olsres = olsmodel1.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.853
Model: OLS Adj. R-squared: 0.850
Method: Least Squares F-statistic: 259.6
Date: Thu, 09 Dec 2021 Prob (F-statistic): 0.00
Time: 19:57:09 Log-Likelihood: 119.76
No. Observations: 2152 AIC: -143.5
Df Residuals: 2104 BIC: 128.8
Df Model: 47
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -51.4385 10.454 -4.921 0.000 -71.939 -30.938
screen_size 0.0307 0.004 7.269 0.000 0.022 0.039
main_camera_mp 0.0224 0.002 13.965 0.000 0.019 0.026
selfie_camera_mp 0.0128 0.001 10.874 0.000 0.010 0.015
int_memory -4.213e-05 6.95e-05 -0.606 0.545 -0.000 9.42e-05
ram 0.0250 0.006 4.538 0.000 0.014 0.036
battery -9.427e-06 1.08e-05 -0.871 0.384 -3.07e-05 1.18e-05
weight 0.0011 0.000 3.587 0.000 0.001 0.002
release_year 0.0261 0.005 5.028 0.000 0.016 0.036
days_used 5.198e-05 3.31e-05 1.570 0.117 -1.3e-05 0.000
new_price_log 0.4225 0.013 32.143 0.000 0.397 0.448
brand_name_Alcatel -0.0155 0.051 -0.302 0.762 -0.116 0.085
brand_name_Apple 0.1105 0.041 2.674 0.008 0.029 0.192
brand_name_Asus 0.0348 0.053 0.661 0.509 -0.068 0.138
brand_name_BlackBerry -0.0453 0.081 -0.558 0.577 -0.204 0.114
brand_name_Celkon -0.0908 0.072 -1.267 0.205 -0.231 0.050
brand_name_Coolpad 0.0827 0.070 1.179 0.238 -0.055 0.220
brand_name_Gionee 0.0406 0.057 0.713 0.476 -0.071 0.152
brand_name_Google -0.0452 0.086 -0.523 0.601 -0.215 0.124
brand_name_HTC -0.0044 0.051 -0.087 0.931 -0.104 0.095
brand_name_Honor 0.0197 0.052 0.379 0.704 -0.082 0.122
brand_name_Huawei -0.0056 0.048 -0.116 0.908 -0.099 0.088
brand_name_Infinix 0.1526 0.100 1.529 0.126 -0.043 0.348
brand_name_Karbonn 0.0008 0.074 0.011 0.991 -0.145 0.146
brand_name_LG -0.0175 0.048 -0.364 0.716 -0.112 0.077
brand_name_Lava 0.0355 0.065 0.549 0.583 -0.091 0.162
brand_name_Lenovo 0.0298 0.049 0.606 0.545 -0.067 0.126
brand_name_Meizu 0.0036 0.058 0.062 0.950 -0.110 0.117
brand_name_Micromax -0.0325 0.051 -0.633 0.527 -0.133 0.068
brand_name_Microsoft 0.1072 0.088 1.220 0.222 -0.065 0.279
brand_name_Motorola -0.0044 0.053 -0.084 0.933 -0.108 0.099
brand_name_Nokia 0.0623 0.055 1.143 0.253 -0.045 0.169
brand_name_OnePlus 0.1240 0.075 1.648 0.100 -0.024 0.271
brand_name_Oppo 0.0278 0.051 0.549 0.583 -0.071 0.127
brand_name_Others -0.0074 0.045 -0.163 0.871 -0.097 0.082
brand_name_Panasonic 0.0047 0.059 0.079 0.937 -0.111 0.120
brand_name_Realme 0.0810 0.062 1.302 0.193 -0.041 0.203
brand_name_Samsung 0.0182 0.047 0.388 0.698 -0.074 0.110
brand_name_Sony -0.0301 0.054 -0.559 0.576 -0.136 0.076
brand_name_Spice 0.0016 0.065 0.025 0.980 -0.126 0.129
brand_name_Vivo -0.0255 0.051 -0.503 0.615 -0.125 0.074
brand_name_XOLO -0.0032 0.058 -0.056 0.955 -0.116 0.110
brand_name_Xiaomi 0.0716 0.051 1.417 0.157 -0.027 0.171
brand_name_ZTE -0.0130 0.050 -0.259 0.796 -0.111 0.085
os_Others 0.0113 0.035 0.324 0.746 -0.057 0.080
os_Windows -0.0220 0.044 -0.499 0.618 -0.108 0.064
os_iOS 0.1105 0.041 2.674 0.008 0.029 0.192
4g_yes 0.0513 0.018 2.927 0.003 0.017 0.086
5g_yes -0.0158 0.032 -0.495 0.621 -0.078 0.047
==============================================================================
Omnibus: 211.745 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 450.212
Skew: -0.613 Prob(JB): 1.73e-98
Kurtosis: 4.876 Cond. No. 1.02e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.8e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Observations:
1. Negative co-effecient suggests that as the independent variables increases, the dependent variable(used price log) tends to decrease.
2. The used price increases with increase in screen size, main camera mp,selfie camera mp,int memory,ram and new price.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsres, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229 | 0.179 | 0.853 | 0.849 | 4.324 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsres, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.238 | 0.182 | 0.823 | 0.813 | 4.334 |
Observations:
The R square dropped from 0.85 to 0.82 ,which shows the performace of the model is not so good.
1.We are trying to build a linear regression model for the used devices.
2.devices can be phone or tablet.
3.The devices age between 2013 &2020.
4.The devices can be smart or simple and they belong to different brands with different specifications.
5.The data can vary from device to device.
6.There are no extreme or incorrect measured data .
Lets check the performance without dropping the Outliers.
# defining the dependent and independent variables
X1 =df_model.drop(columns=['new_price','used_price','used_price_log','device_category_Mid-ranger','device_category_Premium'])## drop "new_price", "used_price", "used_price_log", "device_category" from the data
y1 = df_model["used_price_log"]
print(X1.head())
print()
print(y1.head())
screen_size main_camera_mp selfie_camera_mp int_memory ram battery \ 0 14.500 13.000 5.000 64.000 3.000 3020.000 1 17.300 13.000 16.000 128.000 8.000 4300.000 2 16.690 13.000 8.000 128.000 8.000 4200.000 3 25.500 13.000 8.000 64.000 6.000 7250.000 4 15.320 13.000 8.000 64.000 3.000 5000.000 weight release_year days_used new_price_log ... brand_name_Spice \ 0 146.000 2020 127 4.715 ... 0 1 213.000 2020 325 5.519 ... 0 2 213.000 2020 162 5.885 ... 0 3 480.000 2020 345 5.631 ... 0 4 185.000 2020 293 4.948 ... 0 brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE \ 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 os_Others os_Windows os_iOS 4g_yes 5g_yes 0 0 0 0 1 0 1 0 0 0 1 1 2 0 0 0 1 1 3 0 0 0 1 1 4 0 0 0 1 0 [5 rows x 48 columns] 0 4.308 1 5.162 2 5.111 3 5.135 4 4.390 Name: used_price_log, dtype: float64
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X1, y1, test_size=0.30, random_state=1
) ## code to split the data into train and test in specified ratio
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
Let's build a linear regression model using statsmodels.
# adding constant to the train data
x_train = sm.add_constant(x_train)
# adding constant to the test data
x_test = sm.add_constant(x_test)## code to add contant to the test data
olsmodel1 = sm.OLS(y_train, x_train)## code to fit OLS model
olsres = olsmodel1.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 268.8
Date: Thu, 09 Dec 2021 Prob (F-statistic): 0.00
Time: 20:17:09 Log-Likelihood: 124.22
No. Observations: 2417 AIC: -150.4
Df Residuals: 2368 BIC: 133.3
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -48.6977 9.184 -5.303 0.000 -66.707 -30.689
screen_size 0.0243 0.003 7.145 0.000 0.018 0.031
main_camera_mp 0.0203 0.001 13.806 0.000 0.017 0.023
selfie_camera_mp 0.0136 0.001 12.084 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.542 0.123 -2.92e-05 0.000
ram 0.0239 0.005 4.657 0.000 0.014 0.034
battery -1.585e-05 7.27e-06 -2.181 0.029 -3.01e-05 -1.6e-06
weight 0.0010 0.000 7.421 0.000 0.001 0.001
release_year 0.0248 0.005 5.441 0.000 0.016 0.034
days_used 3.485e-05 3.09e-05 1.127 0.260 -2.58e-05 9.55e-05
new_price_log 0.4310 0.012 35.133 0.000 0.407 0.455
brand_name_Alcatel 0.0153 0.048 0.321 0.748 -0.078 0.109
brand_name_Apple -0.0116 0.147 -0.079 0.937 -0.300 0.277
brand_name_Asus 0.0195 0.048 0.408 0.683 -0.074 0.113
brand_name_BlackBerry -0.0295 0.070 -0.420 0.675 -0.167 0.108
brand_name_Celkon -0.0424 0.066 -0.640 0.522 -0.172 0.088
brand_name_Coolpad 0.0401 0.073 0.551 0.582 -0.103 0.183
brand_name_Gionee 0.0454 0.058 0.787 0.431 -0.068 0.159
brand_name_Google -0.0312 0.085 -0.369 0.712 -0.197 0.135
brand_name_HTC -0.0115 0.048 -0.240 0.811 -0.106 0.083
brand_name_Honor 0.0244 0.049 0.496 0.620 -0.072 0.121
brand_name_Huawei -0.0081 0.044 -0.181 0.856 -0.095 0.079
brand_name_Infinix 0.1548 0.093 1.661 0.097 -0.028 0.337
brand_name_Karbonn 0.0971 0.067 1.447 0.148 -0.034 0.229
brand_name_LG -0.0152 0.045 -0.335 0.738 -0.104 0.074
brand_name_Lava 0.0337 0.062 0.541 0.589 -0.089 0.156
brand_name_Lenovo 0.0449 0.045 0.994 0.320 -0.044 0.134
brand_name_Meizu 0.0080 0.056 0.143 0.887 -0.102 0.118
brand_name_Micromax -0.0335 0.048 -0.700 0.484 -0.127 0.060
brand_name_Microsoft 0.0945 0.088 1.070 0.285 -0.079 0.268
brand_name_Motorola 0.0045 0.050 0.091 0.928 -0.093 0.102
brand_name_Nokia 0.0671 0.052 1.297 0.195 -0.034 0.169
brand_name_OnePlus 0.1235 0.077 1.596 0.111 -0.028 0.275
brand_name_Oppo 0.0198 0.048 0.414 0.679 -0.074 0.113
brand_name_Others -0.0080 0.042 -0.191 0.849 -0.091 0.074
brand_name_Panasonic 0.0574 0.056 1.028 0.304 -0.052 0.167
brand_name_Realme 0.1197 0.061 1.951 0.051 -0.001 0.240
brand_name_Samsung -0.0324 0.043 -0.749 0.454 -0.117 0.052
brand_name_Sony -0.0493 0.050 -0.979 0.328 -0.148 0.049
brand_name_Spice -0.0132 0.063 -0.208 0.835 -0.137 0.111
brand_name_Vivo -0.0082 0.048 -0.170 0.865 -0.103 0.087
brand_name_XOLO 0.0102 0.055 0.187 0.852 -0.097 0.118
brand_name_Xiaomi 0.0978 0.048 2.034 0.042 0.004 0.192
brand_name_ZTE -0.0038 0.047 -0.079 0.937 -0.097 0.089
os_Others -0.0513 0.033 -1.566 0.117 -0.116 0.013
os_Windows -0.0176 0.045 -0.389 0.697 -0.106 0.071
os_iOS -0.0585 0.146 -0.399 0.690 -0.346 0.229
4g_yes 0.0507 0.016 3.190 0.001 0.020 0.082
5g_yes -0.0435 0.032 -1.369 0.171 -0.106 0.019
==============================================================================
Omnibus: 217.620 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 409.702
Skew: -0.607 Prob(JB): 1.08e-89
Kurtosis: 4.611 Cond. No. 7.69e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsres, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.230 | 0.180 | 0.845 | 0.842 | 4.327 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsres, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.238 | 0.184 | 0.843 | 0.835 | 4.488 |
Observations:
1. The R square did not drop drastically.
2.The training 𝑅2 is 0.845, so the model is not underfitting.
3.The train and test RMSE and MAE are comparable, so the model is not overfitting either.
4. MAPE of 4.48 on the test data means that we are able to predict within 4.48% of the used price.
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient 𝛽𝑘
is "inflated" by the existence of correlation among the predictor variables in the model.
If VIF is 1, then there is no correlation among the 𝑘
th predictor and the remaining predictor variables, and hence, the variance of 𝛽𝑘
is not inflated at all.
General Rule of thumb:
If VIF is between 1 and 5, then there is low multicollinearity.
If VIF is between 5 and 10, we say there is moderate multicollinearity.
If VIF is exceeding 10, it shows signs of high multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train)###lets ckeck for the VIF of the training dataset
| feature | VIF | |
|---|---|---|
| 0 | const | 3780344.203 |
| 1 | screen_size | 7.681 |
| 2 | main_camera_mp | 2.137 |
| 3 | selfie_camera_mp | 2.808 |
| 4 | int_memory | 1.361 |
| 5 | ram | 2.258 |
| 6 | battery | 4.074 |
| 7 | weight | 6.381 |
| 8 | release_year | 4.885 |
| 9 | days_used | 2.669 |
| 10 | new_price_log | 3.122 |
| 11 | brand_name_Alcatel | 3.406 |
| 12 | brand_name_Apple | 13.057 |
| 13 | brand_name_Asus | 3.330 |
| 14 | brand_name_BlackBerry | 1.632 |
| 15 | brand_name_Celkon | 1.774 |
| 16 | brand_name_Coolpad | 1.467 |
| 17 | brand_name_Gionee | 1.951 |
| 18 | brand_name_Google | 1.322 |
| 19 | brand_name_HTC | 3.410 |
| 20 | brand_name_Honor | 3.346 |
| 21 | brand_name_Huawei | 5.986 |
| 22 | brand_name_Infinix | 1.284 |
| 23 | brand_name_Karbonn | 1.573 |
| 24 | brand_name_LG | 4.849 |
| 25 | brand_name_Lava | 1.711 |
| 26 | brand_name_Lenovo | 4.559 |
| 27 | brand_name_Meizu | 2.173 |
| 28 | brand_name_Micromax | 3.363 |
| 29 | brand_name_Microsoft | 1.869 |
| 30 | brand_name_Motorola | 3.260 |
| 31 | brand_name_Nokia | 3.472 |
| 32 | brand_name_OnePlus | 1.437 |
| 33 | brand_name_Oppo | 3.972 |
| 34 | brand_name_Others | 9.711 |
| 35 | brand_name_Panasonic | 2.105 |
| 36 | brand_name_Realme | 1.931 |
| 37 | brand_name_Samsung | 7.540 |
| 38 | brand_name_Sony | 2.932 |
| 39 | brand_name_Spice | 1.689 |
| 40 | brand_name_Vivo | 3.648 |
| 41 | brand_name_XOLO | 2.137 |
| 42 | brand_name_Xiaomi | 3.712 |
| 43 | brand_name_ZTE | 3.796 |
| 44 | os_Others | 1.855 |
| 45 | os_Windows | 1.595 |
| 46 | os_iOS | 11.785 |
| 47 | 4g_yes | 2.479 |
| 48 | 5g_yes | 1.845 |
Observations:
1.screen size, weight,brand name Huawei,brand name others and brand name samsung have VIF greater than 5.
2. As the screen size increases ,the weight of the device will also increase, they are highly correlated.
3. brand name others had the maximun VIF followed by the screen size , lets remove one by one and check for VIF
again.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
#create a list with numerical variables having VIF>5
col_list = [
"screen_size",
"weight",
"brand_name_Apple",
"brand_name_Huawei",
"brand_name_Others",
"brand_name_Samsung",
"os_iOS"
]
res = treating_multicollinearity(x_train, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Apple | 0.842 | 0.232 |
| 1 | brand_name_Huawei | 0.842 | 0.232 |
| 2 | brand_name_Others | 0.842 | 0.232 |
| 3 | os_iOS | 0.842 | 0.232 |
| 4 | brand_name_Samsung | 0.842 | 0.232 |
| 5 | screen_size | 0.838 | 0.235 |
| 6 | weight | 0.838 | 0.235 |
Observation:
1. Dropping screen size and weight can impact the model performance .
2. lets drop brand name Apple,since it has the maximun VIF ,and check the VIF again.
col_to_drop = "brand_name_Apple"##lets remove the col brand name apple
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Apple
| feature | VIF | |
|---|---|---|
| 0 | const | 3773730.652 |
| 1 | screen_size | 7.647 |
| 2 | main_camera_mp | 2.136 |
| 3 | selfie_camera_mp | 2.787 |
| 4 | int_memory | 1.361 |
| 5 | ram | 2.250 |
| 6 | battery | 4.072 |
| 7 | weight | 6.378 |
| 8 | release_year | 4.876 |
| 9 | days_used | 2.669 |
| 10 | new_price_log | 3.106 |
| 11 | brand_name_Alcatel | 3.231 |
| 12 | brand_name_Asus | 3.143 |
| 13 | brand_name_BlackBerry | 1.561 |
| 14 | brand_name_Celkon | 1.731 |
| 15 | brand_name_Coolpad | 1.435 |
| 16 | brand_name_Gionee | 1.886 |
| 17 | brand_name_Google | 1.293 |
| 18 | brand_name_HTC | 3.240 |
| 19 | brand_name_Honor | 3.164 |
| 20 | brand_name_Huawei | 5.583 |
| 21 | brand_name_Infinix | 1.265 |
| 22 | brand_name_Karbonn | 1.544 |
| 23 | brand_name_LG | 4.564 |
| 24 | brand_name_Lava | 1.671 |
| 25 | brand_name_Lenovo | 4.292 |
| 26 | brand_name_Meizu | 2.086 |
| 27 | brand_name_Micromax | 3.214 |
| 28 | brand_name_Microsoft | 1.835 |
| 29 | brand_name_Motorola | 3.095 |
| 30 | brand_name_Nokia | 3.257 |
| 31 | brand_name_OnePlus | 1.402 |
| 32 | brand_name_Oppo | 3.764 |
| 33 | brand_name_Others | 9.074 |
| 34 | brand_name_Panasonic | 2.031 |
| 35 | brand_name_Realme | 1.864 |
| 36 | brand_name_Samsung | 6.991 |
| 37 | brand_name_Sony | 2.787 |
| 38 | brand_name_Spice | 1.655 |
| 39 | brand_name_Vivo | 3.444 |
| 40 | brand_name_XOLO | 2.068 |
| 41 | brand_name_Xiaomi | 3.505 |
| 42 | brand_name_ZTE | 3.602 |
| 43 | os_Others | 1.736 |
| 44 | os_Windows | 1.593 |
| 45 | os_iOS | 1.909 |
| 46 | 4g_yes | 2.478 |
| 47 | 5g_yes | 1.837 |
After dropping the brand name Apple , the VIF of brand name iOS reduced and so it will not be dropped from the model. Lets drop brand name others next.
col_list = [
"screen_size",
"weight",
"brand_name_Huawei",
"brand_name_Others",
"brand_name_Samsung",
"os_iOS"
]
res = treating_multicollinearity(x_train2, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Huawei | 0.842 | 0.232 |
| 1 | brand_name_Others | 0.842 | 0.232 |
| 2 | brand_name_Samsung | 0.842 | 0.232 |
| 3 | os_iOS | 0.842 | 0.232 |
| 4 | screen_size | 0.839 | 0.235 |
| 5 | weight | 0.838 | 0.235 |
lets drop brand name others as it has the greatest VIF compared to the rest.
col_to_drop = "brand_name_Others"##lets remove brand name others
x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train3)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Others
| feature | VIF | |
|---|---|---|
| 0 | const | 3773582.100 |
| 1 | screen_size | 7.574 |
| 2 | main_camera_mp | 2.134 |
| 3 | selfie_camera_mp | 2.785 |
| 4 | int_memory | 1.361 |
| 5 | ram | 2.247 |
| 6 | battery | 4.071 |
| 7 | weight | 6.353 |
| 8 | release_year | 4.875 |
| 9 | days_used | 2.669 |
| 10 | new_price_log | 3.104 |
| 11 | brand_name_Alcatel | 1.218 |
| 12 | brand_name_Asus | 1.200 |
| 13 | brand_name_BlackBerry | 1.124 |
| 14 | brand_name_Celkon | 1.174 |
| 15 | brand_name_Coolpad | 1.053 |
| 16 | brand_name_Gionee | 1.091 |
| 17 | brand_name_Google | 1.045 |
| 18 | brand_name_HTC | 1.221 |
| 19 | brand_name_Honor | 1.285 |
| 20 | brand_name_Huawei | 1.500 |
| 21 | brand_name_Infinix | 1.062 |
| 22 | brand_name_Karbonn | 1.073 |
| 23 | brand_name_LG | 1.343 |
| 24 | brand_name_Lava | 1.071 |
| 25 | brand_name_Lenovo | 1.298 |
| 26 | brand_name_Meizu | 1.130 |
| 27 | brand_name_Micromax | 1.228 |
| 28 | brand_name_Microsoft | 1.494 |
| 29 | brand_name_Motorola | 1.245 |
| 30 | brand_name_Nokia | 1.483 |
| 31 | brand_name_OnePlus | 1.081 |
| 32 | brand_name_Oppo | 1.374 |
| 33 | brand_name_Panasonic | 1.106 |
| 34 | brand_name_Realme | 1.159 |
| 35 | brand_name_Samsung | 1.576 |
| 36 | brand_name_Sony | 1.203 |
| 37 | brand_name_Spice | 1.081 |
| 38 | brand_name_Vivo | 1.322 |
| 39 | brand_name_XOLO | 1.115 |
| 40 | brand_name_Xiaomi | 1.309 |
| 41 | brand_name_ZTE | 1.262 |
| 42 | os_Others | 1.724 |
| 43 | os_Windows | 1.592 |
| 44 | os_iOS | 1.177 |
| 45 | 4g_yes | 2.473 |
| 46 | 5g_yes | 1.836 |
Observation: After dropping brand name others, only screen size and weight have VIF >5.
Lets drop screen size and check.
col_list = [
"screen_size",
"weight",
"brand_name_Huawei",
"brand_name_Samsung",
"os_iOS"
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Huawei | 0.842 | 0.232 |
| 1 | brand_name_Samsung | 0.842 | 0.232 |
| 2 | os_iOS | 0.842 | 0.232 |
| 3 | screen_size | 0.839 | 0.235 |
| 4 | weight | 0.838 | 0.235 |
col_to_drop = "screen_size"##lets remove screen size
x_train4 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test4 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train4)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping screen_size
| feature | VIF | |
|---|---|---|
| 0 | const | 3642422.171 |
| 1 | main_camera_mp | 2.132 |
| 2 | selfie_camera_mp | 2.783 |
| 3 | int_memory | 1.358 |
| 4 | ram | 2.247 |
| 5 | battery | 3.826 |
| 6 | weight | 2.974 |
| 7 | release_year | 4.702 |
| 8 | days_used | 2.656 |
| 9 | new_price_log | 3.057 |
| 10 | brand_name_Alcatel | 1.202 |
| 11 | brand_name_Asus | 1.198 |
| 12 | brand_name_BlackBerry | 1.124 |
| 13 | brand_name_Celkon | 1.172 |
| 14 | brand_name_Coolpad | 1.052 |
| 15 | brand_name_Gionee | 1.090 |
| 16 | brand_name_Google | 1.045 |
| 17 | brand_name_HTC | 1.221 |
| 18 | brand_name_Honor | 1.269 |
| 19 | brand_name_Huawei | 1.490 |
| 20 | brand_name_Infinix | 1.061 |
| 21 | brand_name_Karbonn | 1.068 |
| 22 | brand_name_LG | 1.343 |
| 23 | brand_name_Lava | 1.069 |
| 24 | brand_name_Lenovo | 1.294 |
| 25 | brand_name_Meizu | 1.130 |
| 26 | brand_name_Micromax | 1.226 |
| 27 | brand_name_Microsoft | 1.494 |
| 28 | brand_name_Motorola | 1.245 |
| 29 | brand_name_Nokia | 1.482 |
| 30 | brand_name_OnePlus | 1.080 |
| 31 | brand_name_Oppo | 1.372 |
| 32 | brand_name_Panasonic | 1.104 |
| 33 | brand_name_Realme | 1.159 |
| 34 | brand_name_Samsung | 1.573 |
| 35 | brand_name_Sony | 1.202 |
| 36 | brand_name_Spice | 1.081 |
| 37 | brand_name_Vivo | 1.315 |
| 38 | brand_name_XOLO | 1.113 |
| 39 | brand_name_Xiaomi | 1.307 |
| 40 | brand_name_ZTE | 1.261 |
| 41 | os_Others | 1.517 |
| 42 | os_Windows | 1.592 |
| 43 | os_iOS | 1.176 |
| 44 | 4g_yes | 2.472 |
| 45 | 5g_yes | 1.831 |
Observations: The VIFs of all the feature variables are under 5, so the Multi-collinearity has been handled.
olsmod1 = sm.OLS(y_train, x_train4).fit()##checking the model performance
print(olsmod1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.842
Model: OLS Adj. R-squared: 0.839
Method: Least Squares F-statistic: 279.8
Date: Thu, 09 Dec 2021 Prob (F-statistic): 0.00
Time: 18:55:22 Log-Likelihood: 97.980
No. Observations: 2417 AIC: -104.0
Df Residuals: 2371 BIC: 162.4
Df Model: 45
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -60.9864 9.107 -6.696 0.000 -78.845 -43.127
main_camera_mp 0.0207 0.001 13.935 0.000 0.018 0.024
selfie_camera_mp 0.0138 0.001 12.212 0.000 0.012 0.016
int_memory 8.634e-05 7.03e-05 1.228 0.220 -5.15e-05 0.000
ram 0.0243 0.005 4.686 0.000 0.014 0.034
battery -2.991e-06 7.12e-06 -0.420 0.674 -1.69e-05 1.1e-05
weight 0.0017 9.19e-05 18.347 0.000 0.002 0.002
release_year 0.0309 0.005 6.852 0.000 0.022 0.040
days_used 1.989e-05 3.12e-05 0.639 0.523 -4.12e-05 8.1e-05
new_price_log 0.4419 0.012 36.034 0.000 0.418 0.466
brand_name_Alcatel 0.0457 0.029 1.601 0.110 -0.010 0.102
brand_name_Asus 0.0336 0.029 1.156 0.248 -0.023 0.090
brand_name_BlackBerry -0.0177 0.059 -0.301 0.763 -0.133 0.098
brand_name_Celkon -0.0185 0.054 -0.341 0.733 -0.125 0.088
brand_name_Coolpad 0.0619 0.062 0.993 0.321 -0.060 0.184
brand_name_Gionee 0.0441 0.044 1.012 0.312 -0.041 0.130
brand_name_Google -0.0301 0.076 -0.396 0.692 -0.179 0.119
brand_name_HTC -0.0023 0.029 -0.079 0.937 -0.059 0.055
brand_name_Honor 0.0560 0.031 1.828 0.068 -0.004 0.116
brand_name_Huawei 0.0128 0.022 0.570 0.569 -0.031 0.057
brand_name_Infinix 0.1709 0.086 1.996 0.046 0.003 0.339
brand_name_Karbonn 0.1315 0.056 2.354 0.019 0.022 0.241
brand_name_LG -0.0065 0.024 -0.270 0.787 -0.054 0.041
brand_name_Lava 0.0569 0.050 1.143 0.253 -0.041 0.154
brand_name_Lenovo 0.0620 0.024 2.547 0.011 0.014 0.110
brand_name_Meizu 0.0214 0.041 0.525 0.599 -0.059 0.101
brand_name_Micromax -0.0182 0.029 -0.623 0.533 -0.075 0.039
brand_name_Microsoft 0.1050 0.080 1.316 0.188 -0.051 0.261
brand_name_Motorola 0.0114 0.031 0.370 0.711 -0.049 0.072
brand_name_Nokia 0.0699 0.034 2.046 0.041 0.003 0.137
brand_name_OnePlus 0.1496 0.068 2.206 0.027 0.017 0.283
brand_name_Oppo 0.0352 0.028 1.241 0.215 -0.020 0.091
brand_name_Panasonic 0.0778 0.041 1.906 0.057 -0.002 0.158
brand_name_Realme 0.1324 0.048 2.757 0.006 0.038 0.227
brand_name_Samsung -0.0193 0.020 -0.966 0.334 -0.058 0.020
brand_name_Sony -0.0370 0.033 -1.135 0.256 -0.101 0.027
brand_name_Spice -0.0104 0.051 -0.204 0.838 -0.111 0.090
brand_name_Vivo 0.0148 0.029 0.505 0.613 -0.043 0.072
brand_name_XOLO 0.0303 0.040 0.759 0.448 -0.048 0.109
brand_name_Xiaomi 0.1126 0.029 3.907 0.000 0.056 0.169
brand_name_ZTE 0.0086 0.028 0.312 0.755 -0.045 0.063
os_Others -0.1304 0.030 -4.354 0.000 -0.189 -0.072
os_Windows -0.0157 0.046 -0.345 0.730 -0.105 0.074
os_iOS -0.0729 0.047 -1.560 0.119 -0.165 0.019
4g_yes 0.0481 0.016 3.000 0.003 0.017 0.080
5g_yes -0.0549 0.032 -1.717 0.086 -0.118 0.008
==============================================================================
Omnibus: 228.311 Durbin-Watson: 1.901
Prob(Omnibus): 0.000 Jarque-Bera (JB): 425.216
Skew: -0.635 Prob(JB): 4.63e-93
Kurtosis: 4.615 Cond. No. 7.54e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.54e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Adjusted. R-squared: It reflects the fit of the model.
Adjusted R-squared values generally range from 0 to 1, where a higher value generally indicates a better fit,
assuming certain conditions are met.
In our case, the value for adj. R-squared is 0.839, which is good!
const coefficient: It is the Y-intercept.
It means that if all the predictor variable coefficients are zero, then the expected output (i.e., Y) would be equal to the const coefficient.
In our case, the value for const coefficient is -60.98
std err: It reflects the level of accuracy of the coefficients.it is 9.107
The lower it is, the higher is the level of accuracy.
P>|t|: It is p-value.
For each independent feature, there is a null hypothesis and an alternate hypothesis.
here 𝛽 is the coefficient of the 𝑖th independent variable.
𝐻𝑜: Independent feature is not significant (𝛽𝑖=0)
𝐻𝑎: Independent feature is that it is significant (𝛽𝑖≠0)
(P>|t|) gives the p-value for each independent feature to check that null hypothesis. We are considering 0.05 (5%) as significance level.
A p-value of less than 0.05 is considered to be statistically significant.
Confidence Interval: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).
Observations:
1.We can see that adj. R-squared has dropped from 0.842 to 0.8389, which shows that the dropped columns did not have much effect on the model.
As there is no multicollinearity, we can look at the p-values of predictor variables to check their significance.
# initial list of columns
cols= x_train4.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train4[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'weight', 'release_year', 'new_price_log', 'brand_name_Karbonn', 'brand_name_Lenovo', 'brand_name_OnePlus', 'brand_name_Realme', 'brand_name_Xiaomi', 'os_Others', '4g_yes']
x_train5 = x_train4[ selected_features]
x_test5 = x_test4[selected_features]
olsmod2 = sm.OLS(y_train, x_train5).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 964.3
Date: Thu, 09 Dec 2021 Prob (F-statistic): 0.00
Time: 19:10:48 Log-Likelihood: 79.905
No. Observations: 2417 AIC: -131.8
Df Residuals: 2403 BIC: -50.75
Df Model: 13
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const -59.7099 6.737 -8.862 0.000 -72.922 -46.498
main_camera_mp 0.0211 0.001 15.550 0.000 0.018 0.024
selfie_camera_mp 0.0144 0.001 13.525 0.000 0.012 0.017
ram 0.0192 0.004 4.298 0.000 0.010 0.028
weight 0.0016 6.02e-05 27.334 0.000 0.002 0.002
release_year 0.0303 0.003 9.071 0.000 0.024 0.037
new_price_log 0.4315 0.011 39.883 0.000 0.410 0.453
brand_name_Karbonn 0.1210 0.055 2.210 0.027 0.014 0.228
brand_name_Lenovo 0.0513 0.022 2.371 0.018 0.009 0.094
brand_name_OnePlus 0.1361 0.066 2.055 0.040 0.006 0.266
brand_name_Realme 0.1005 0.046 2.199 0.028 0.011 0.190
brand_name_Xiaomi 0.0976 0.026 3.802 0.000 0.047 0.148
os_Others -0.1356 0.027 -5.025 0.000 -0.189 -0.083
4g_yes 0.0457 0.015 3.041 0.002 0.016 0.075
==============================================================================
Omnibus: 245.549 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 480.492
Skew: -0.658 Prob(JB): 4.60e-105
Kurtosis: 4.743 Cond. No. 2.86e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.86e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations:
1. Now there are no variables with the p_value >0.05.
2. the performance of the model as also not changed much,but the std err has reduced.
3. The variables we dropped are not affecting the model too much.
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Why the test?
Linearity describes a straight-line relationship between two variables, predictor variables must have a linear relation with the dependent variable.
The independence of the error terms (or residuals) is important. If the residuals are not independent, then the confidence intervals of the coefficient estimates will be narrower and make us incorrectly conclude a parameter to be statistically significant.
How to check linearity and independence?
Make a plot of fitted values vs residuals.
If they don't follow any pattern, then we say the model is linear and residuals are independent.
Otherwise, the model is showing signs of non-linearity and residuals are not independent.
How to fix if this assumption is not followed?
We can try to transform the variables and make the relationships linear.
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 3026 | 4.087 | 3.860 | 0.228 |
| 1525 | 4.448 | 4.639 | -0.191 |
| 1128 | 4.315 | 4.280 | 0.035 |
| 3003 | 4.282 | 4.239 | 0.043 |
| 2907 | 4.456 | 4.489 | -0.033 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
Observation: we see no pattern in the plot, fitted vrs residuals.Hence the assumptions of Linearity and independence are satisfied.
Why the test?
Error terms, or residuals, should be normally distributed. If the error terms are not normally distributed, confidence intervals of the coefficient estimates may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Non-normality suggests that there are a few unusual data points that must be studied closely to make a better model.
How to check normality?
The shape of the histogram of residuals can give an initial idea about the normality.
It can also be checked via a Q-Q plot of residuals. If the residuals follow a normal distribution, they will make a straight line plot, otherwise not.
Other tests to check for normality includes the Shapiro-Wilk test.
Null hypothesis: Residuals are normally distributed
Alternate hypothesis: Residuals are not normally distributed
How to fix if this assumption is not followed?
We can apply transformations like log, exponential, arcsinh, etc. as per our data.
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
The residuals more or less follow a straight line except for the tails.
Let's check the results of the Shapiro-Wilk test.
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9678751826286316, pvalue=8.044154752235425e-23)
Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
Strictly speaking, the residuals are not normal.
However, as an approximation, we can accept this distribution as close to being normal.
So, the assumption is satisfied.
Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.
Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.
Why the test?
The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers.
How to check for homoscedasticity?
The residual vs fitted values plot can be looked at to check for homoscedasticity. In the case of heteroscedasticity, the residuals can form an arrow shape or any other non-symmetrical shape.
The goldfeldquandt test can also be used. If we get a p-value > 0.05 we can say that the residuals are homoscedastic. Otherwise, they are heteroscedastic.
Null hypothesis: Residuals are homoscedastic
Alternate hypothesis: Residuals have heteroscedasticity
How to fix if this assumption is not followed?
Heteroscedasticity can be fixed by adding other important features or making transformations.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train4)
lzip(name, test)
[('F statistic', 1.0216240134000527), ('p-value', 0.35769077877709526)]
Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.
Now that we have checked all the assumptions of linear regression and they are satisfied, let's go ahead with prediction.
# predictions on the test set
pred = olsmod2.predict(x_test5)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1995 | 4.567 | 4.377 |
| 2341 | 3.696 | 3.996 |
| 1913 | 3.592 | 3.638 |
| 688 | 4.306 | 4.090 |
| 650 | 4.522 | 5.177 |
| 2291 | 4.259 | 4.386 |
| 40 | 4.998 | 5.437 |
| 1884 | 3.875 | 4.045 |
| 2538 | 4.207 | 4.060 |
| 45 | 5.380 | 5.278 |
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
Observations: The model has done pretty good ,the difference between actual and predicted values is less.
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train5, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.234 | 0.183 | 0.839 | 0.838 | 4.410 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test5, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.242 | 0.187 | 0.838 | 0.836 | 4.568 |
# training performance comparison
models_train_comp_df = pd.concat(
[olsmodel_train_perf.T, olsmod2_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Linear Regression (all variables)",
"Linear Regression (selected variables)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression (all variables) | Linear Regression (selected variables) | |
|---|---|---|
| RMSE | 0.230 | 0.234 |
| MAE | 0.180 | 0.183 |
| R-squared | 0.845 | 0.839 |
| Adj. R-squared | 0.842 | 0.838 |
| MAPE | 4.327 | 4.410 |
# test performance comparison
models_test_comp_df = pd.concat(
[olsmodel_test_perf.T, olsmod2_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Linear Regression (all variables)",
"Linear Regression (selected variables)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Linear Regression (all variables) | Linear Regression (selected variables) | |
|---|---|---|
| RMSE | 0.238 | 0.242 |
| MAE | 0.184 | 0.187 |
| R-squared | 0.843 | 0.838 |
| Adj. R-squared | 0.835 | 0.836 |
| MAPE | 4.488 | 4.568 |
olsmodel_final = sm.OLS(y_train, x_train5).fit()##lets check the model performance
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 964.3
Date: Thu, 09 Dec 2021 Prob (F-statistic): 0.00
Time: 19:11:45 Log-Likelihood: 79.905
No. Observations: 2417 AIC: -131.8
Df Residuals: 2403 BIC: -50.75
Df Model: 13
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const -59.7099 6.737 -8.862 0.000 -72.922 -46.498
main_camera_mp 0.0211 0.001 15.550 0.000 0.018 0.024
selfie_camera_mp 0.0144 0.001 13.525 0.000 0.012 0.017
ram 0.0192 0.004 4.298 0.000 0.010 0.028
weight 0.0016 6.02e-05 27.334 0.000 0.002 0.002
release_year 0.0303 0.003 9.071 0.000 0.024 0.037
new_price_log 0.4315 0.011 39.883 0.000 0.410 0.453
brand_name_Karbonn 0.1210 0.055 2.210 0.027 0.014 0.228
brand_name_Lenovo 0.0513 0.022 2.371 0.018 0.009 0.094
brand_name_OnePlus 0.1361 0.066 2.055 0.040 0.006 0.266
brand_name_Realme 0.1005 0.046 2.199 0.028 0.011 0.190
brand_name_Xiaomi 0.0976 0.026 3.802 0.000 0.047 0.148
os_Others -0.1356 0.027 -5.025 0.000 -0.189 -0.083
4g_yes 0.0457 0.015 3.041 0.002 0.016 0.075
==============================================================================
Omnibus: 245.549 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 480.492
Skew: -0.658 Prob(JB): 4.60e-105
Kurtosis: 4.743 Cond. No. 2.86e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.86e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
OBJECTIVE AND DATA ANALYSIS STEPS.
1.The data(used device data) contains the different attributes of used/refurbished phones and tablets.
2.We want to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
3.The data set was read into jupyter notebook and necessary initial analysis was made, like finding the shape of the data, statistical description of data.
4.The missing values were treated,necessary Feature engineering was done.outliers were detected.
5.Finally We have built a Linear Regression Model using OLS (Ordinary Least Squares).
INSIGHTS
The model is able to explain ~84% of the variation in the data, which is very good.
The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within 4.48% of the used price.
Hence, we can conclude the model olsmodel_final is good for prediction as well as inference purposes.
RECOMMENDATIONS
1.The demand for used devices is definetly increasing from year to year.
2.There is a huge demand for all types of used devices,maybe its a tablet or a mobile phone.
Detail Analysis of the data set shows that, there are wide range of devices, with different specifications .
There are many options available in the used market , suitable for everyone and for every budget.
People have many options, to chose from their favorite brands.
.Recent used devices usually cost more.
.devices which have operating system other than anroid,ios may be less in demand and cost less.
.Better camera quality and screen size also have a huge positive impact on the used device market as they are in high demand.
.devices from well known brands are also doing good in the used device market.
.Devices which have both 4g and 5g cost more.
.internal memory and RAM capacity are also key specifications, more capacity better price.
The impact of the COVID-19 outbreak has a positive impact on the used device market, as people buy phones and tablets only for immediate needs.
Refurbishing the used devices, providing insurance and offering attracting deals may definetly attract more customers.
IF Devices with better screen size, better camera ,better storage and os from different brands are available at resonable prices compared to the new devices, then the demand for used devices will increase year by year.